In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
sns.set_context("talk")
sns.set_style("ticks")
np.random.seed(1337)

I will be writing about the Extreme value theory (EVT) which was introduced to me by my brother Sudhanshu, while he was working on his internship project. I really liked the connection it has with central limit theorem (CLT). The approach allowed me to better understand central limit theorem as a way to identify distribution of a function applied to independent and identically distributed (i.i.d.) samples. For CLT, the function is the mean or sum applied to the samples. For EVT the function is max or min.

I will do a multi part series trying to explain EVT using examples from real data. In the first part, the focus is on simply understanding the core ideas and introducing the limiting distributions for EVT.

Most of us know about the central limit theorem. It states that the sample means for i.i.d. samples from a distribution with finite variance, follows a normal distribution. More formally, we can write it as follows:

$$ \begin{equation} X_1, X_2, ..., X_n \sim p(X)\\ S_n = \frac{\sum_i X_i}{n}\\ S_n \sim \mathcal{N}(\mu, \sigma^2)\\ \end{equation} $$

Similarly, there exists a the Extreme value theory. This theory deals with the distribution of the extreme values (like minimum or maximum) from the (i.i.d.) samples from some distribution.

$$ \begin{equation} X_1, X_2, ..., X_n \sim p(X)\\ M_n = \max_i X_i\\ M_n \sim {\textrm {GEV}}(\mu ,\,\sigma ,\,\xi ) \\ \end{equation} $$

Surprisingly, like the central limit theorem, these values also converge on a class of distributions called the generalized extreme value distribution which characterizes three types of distributions. These distributions are:

Weibull law: $G(z)={\begin{cases}\exp \left\{-\left(-\left({\frac {z-b}{a}}\right)\right)^{\alpha }\right\}&z<b\\1&z\geq b\end{cases}}$ when the distribution of $M_{n}$ has a light tail with finite upper bound. Also known as Type 3.

Gumbel law: $G(z)=\exp \left\{-\exp \left(-\left({\frac {z-b}{a}}\right)\right)\right\}{\text{ for }}z\in \mathbb {R}$ . when the distribution of $M_{n}$ has an exponential tail. Also known as Type 1

Fréchet Law: $G(z)={\begin{cases}0&z\leq b\\\exp \left\{-\left({\frac {z-b}{a}}\right)^{-\alpha }\right\}&z>b.\end{cases}}$ when the distribution of $M_{n}$ has a heavy tail (including polynomial decay). Also known as Type 2.

In all cases, $\alpha >0$. [Source: https://en.wikipedia.org/wiki/Extreme_value_theory#Univariate_theory ]

The good thing about knowing the distribution of $M_n$ is that we can quantify what is the probability of observing a value as extreme as $M_n$. In case of maximum this can be computed by using the cumulative density function (CDF) as $P(M_n < x)$. If we set a threshhold on this probability (say $\delta = 99\%$) then we can either make systems which are robust to the extreme value $\delta$ percentage of times, also we can also know if we observe something which has CDF more than $\delta$, then it is a very rare event.

Let us see this in action. First we are going to create a draw_samples function which allows us to draw samples from various disitributions available in scipy.stats. We will draw samples of size $n$ and will draw $k$ such samples.


In [68]:
def draw_samples(dist, n, k=1):
    return dist.rvs(size=(n,k))

def plot_samples(dists, aggregations, n=1000, k_max=10000):
    rows, cols = len(aggregations), len(dists)
    fig, ax = plt.subplots(rows, cols, figsize=(5*cols, 3*rows))
    fig_cum, ax_cum = plt.subplots(rows, cols, figsize=(5*cols, 3*rows))
    for i, (dist_name, dist) in enumerate(dists):
        samples = draw_samples(dist, n, k_max)
        dist_mean = dist.mean()
        dist_std = dist.std()
        for j, (agg_name, agg) in enumerate(aggregations):
            for k in [int(10**p) for p in np.arange(1, np.log10(k_max)+1)]:
                normed_samples = samples[:, :k]
                #normed_samples = (samples[:, :k] - dist_mean)/dist_std
                X = agg(normed_samples, axis=1)
                #print(dist_name, agg_name, k, X.shape)
                ax[j, i].hist(
                    X, bins=100, weights=np.ones_like(X)/X.shape[0],
                    cumulative=False,
                    histtype="step", 
                    label=f"$k={k}$"
                )
                ax_cum[j, i].hist(
                    X, bins=100, weights=np.ones_like(X)/X.shape[0],
                    cumulative=True,
                    histtype="step", 
                    label=f"$k={k}$"
                )
            if j == 0:
                ax[j, i].set_title(dist_name)
                ax_cum[j, i].set_title(dist_name)
            if i == 0:
                ax[j, i].set_ylabel(f"$p({agg_name}[X_k])$")
                ax_cum[j, i].set_ylabel(f"$p({agg_name}[X_k] < x)$")
            if j == rows-1 and i == cols-1:
                ax[j, i].legend(loc="best", bbox_to_anchor=(1.01, 0.5))
                ax_cum[j, i].legend(loc="best", bbox_to_anchor=(1.01, 0.5))
    sns.despine(fig=fig, offset=10)
    sns.despine(fig=fig_cum, offset=10)
    fig.tight_layout()
    fig_cum.tight_layout()
    return (fig, ax), (fig_cum, ax_cum)

In [69]:
samples = draw_samples(stats.norm(), 100, 10000)
samples.shape


Out[69]:
(100, 10000)

Let us compare the distribution of means, max, and min of samples from a few distributions distributions e.g. uniform, normal, exponential, logistic, and powerlaw. We are going to plot the PDF as well as the CDF of the distributions. The PDF tells us the neighborhood of high probability density, the CDF gives us a way to bound a certain percentage of the data.


In [71]:
dists = [
    ("uniform", stats.uniform()),
    ("norm", stats.norm()),
    ("expon ", stats.expon()),
    ("logistic", stats.logistic()),
    ("powerlaw", stats.powerlaw(a=5)),
]

aggregations = [
    ("mean", np.mean),
    ("max", np.max),
    ("min", np.min),
]
plot_samples(dists, aggregations, n=1000, k_max=10000);


As we can observe, as the number of samples $k$ increases, the estimation of the mean and the extreme values converges to the true value. However, unlike the central limit theorem, where the variance of estimated mean reduced for higher $k$, for extreme value distributions, the distribution shifts to either right (for max) or left (for min) as $k$ increases. The variance the extreme value distributions also reduces as $k$ increases.

For bounded distributions like uniform, exponential, and powerlaw, the extreme value converges to to the bound as $k$ increases.


In [3]:
normal = stats.norm()
gumbel = stats.genextreme(c=0)
frechet = stats.genextreme(c=-1)
weibull = stats.genextreme(c=1)

In [4]:
fig, ax = plt.subplots(1,2, sharex=True, sharey=True, figsize=(15, 6))

x = np.arange(-3, 5.001, 0.001)

ax[0].plot(x, normal.pdf(x), label="normal", lw=1)
ax[0].plot(x, gumbel.pdf(x), label="gumbel", lw=1)
ax[0].plot(x, frechet.pdf(x), label="frechet", lw=1)
ax[0].plot(x, weibull.pdf(x), label="weibull", lw=1)
ax[0].axvline(x=0, lw=0.5, linestyle="--", color="k")
ax[0].set_ylabel("$P(X=x)$")
ax[0].set_xlabel("$x$")

ax[1].plot(x, normal.cdf(x), label="normal", lw=1)
ax[1].plot(x, gumbel.cdf(x), label="gumbel", lw=1)
ax[1].plot(x, frechet.cdf(x), label="frechet", lw=1)
ax[1].plot(x, weibull.cdf(x), label="weibull", lw=1)
ax[1].axvline(x=0, lw=0.5, linestyle="--", color="k")
ax[1].axhline(y=0.9, lw=0.5, linestyle="--", color="k")
ax[1].set_ylabel("$P(X\leq x)$")
ax[1].set_xlabel("$x$")


lgd = fig.legend(
    *ax[1].get_legend_handles_labels(), 
    bbox_to_anchor=(0.8, 1.05), 
    bbox_transform=fig.transFigure,
    ncol=4
)

for lh in lgd.legendHandles:
    lh.set_linewidth(5)


sns.despine(offset=10)
fig.tight_layout()



In [5]:
iris = sns.load_dataset("iris")
iris.head()


Out[5]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

In [6]:
g = sns.pairplot(iris, hue="species", markers=["o", "s", "D"])



In [7]:
g = sns.pairplot(iris[["petal_length", "petal_width", "species"]], hue="species", markers=["o", "s", "D"])



In [8]:
X = iris[["petal_length", "petal_width"]]
y = iris["species"]

In [9]:
def get_known_mean(X, y, known_classes):
    known_idx = y.isin(known_classes)
    known_mean = X[known_idx].groupby(y[known_idx]).mean()
    return known_mean

def plot_known_mean(X, y, known_mean):
    ax = sns.scatterplot(data=X, x="petal_length", y="petal_width", hue=y, alpha=0.5)
    ax.plot(known_mean["petal_length"], known_mean["petal_width"], marker="*", linestyle="none", ms=20, color="k", alpha=0.9)
    sns.despine(offset=10)
    return ax

In [10]:
known_classes = ["setosa", "versicolor"]
known_mean = get_known_mean(X, y, known_classes)
known_mean


Out[10]:
petal_length petal_width
species
setosa 1.462 0.246
versicolor 4.260 1.326

In [11]:
plot_known_mean(X, y, known_mean)


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50f8d59e48>

In [12]:
def get_errors(X, y, known_mean):
    errors = {
        k: pd.concat([
            np.square(X[y==k] - known_mean.loc[cls]).sum(axis=1)
            for cls in known_mean.index
        ], axis=1).min(axis=1)
        for k in y.unique()
    }
    return errors

def get_error_bins(errors, nbins=40):
    tmp = np.hstack(list(errors.values()))
    print(tmp.shape)
    bins=np.histogram(tmp,bins=nbins)[1]
    print(bins.min(), bins.max())
    return bins

In [13]:
errors = get_errors(X, y, known_mean)
bins = get_error_bins(errors, nbins=40)


(150,)
0.0022759999999999924 7.918276000000003

In [14]:
errors["setosa"].shape


Out[14]:
(50,)

In [15]:
def fit_distributions(errors, known_mean):
    known_errors = pd.concat([errors[cls] for cls in known_mean.index])
    c, loc, scale = stats.genextreme.fit(
        known_errors, 
        loc=known_errors.mean(), 
        scale=known_errors.std()
    )
    print(f"c={c}, loc={loc}, scale={scale}")
    extreme_dist = stats.genextreme(c, loc, scale)
    print(extreme_dist.stats("mvsk"))
    loc, scale = stats.norm.fit(
        known_errors,
    )
    print(f"loc={loc}, scale={scale}")
    norm_dist = stats.norm(loc, scale)
    print(norm_dist.stats("mvsk"))
    return extreme_dist, norm_dist

In [16]:
extreme_dist, norm_dist = fit_distributions(errors, known_mean)


c=-1.538944057044235, loc=0.019979324415496713, scale=0.030411073884826514
(array(nan), array(nan), array(nan), array(nan))
loc=0.147582, scale=0.2558345446025614
(array(0.147582), array(0.06545131), array(0.), array(0.))
/home/napsternxg/miniconda3/lib/python3.7/site-packages/scipy/stats/_continuous_distns.py:2407: RuntimeWarning: invalid value encountered in power
  np.sign(c)*(-g3 + (g2 + 2*g2mg12)*g1)/g2mg12**1.5,

In [17]:
def plot_distributions(extreme_dist, norm_dist, bins):
    fig, ax = plt.subplots(1,2, sharex=True, sharey=False, figsize=(8, 3))
    lower = bins[0]
    #higher = max([extreme_dist.mean() + 2*extreme_dist.std(), norm_dist.mean() + 2*norm_dist.std()])
    higher = bins[-1]
    x = np.linspace(lower, higher, 1000)
    
    ax[0].plot(x, norm_dist.pdf(x), label="normal", lw=10, alpha=0.5)
    ax[0].plot(x, extreme_dist.pdf(x), label="extreme", lw=1)
    #ax[0].axvline(x=0, lw=0.5, linestyle="--", color="k")
    ax[0].set_ylabel("$P(X=x)$")
    ax[0].set_xlabel("$x$")

    ax[1].plot(x, norm_dist.cdf(x), label="normal", lw=10, alpha=0.5)
    ax[1].plot(x, extreme_dist.cdf(x), label="extreme", lw=1)
    #ax[1].axvline(x=0, lw=0.5, linestyle="--", color="k")
    ax[1].axhline(y=0.9, lw=0.5, linestyle="--", color="k")
    ax[1].set_ylabel("$P(X\leq x)$")
    ax[1].set_xlabel("$x$")

    lgd = fig.legend(
        *ax[1].get_legend_handles_labels(), 
        bbox_to_anchor=(0.8, 1.05), 
        bbox_transform=fig.transFigure,
        ncol=4
    )
    for lh in lgd.legendHandles:
        lh.set_linewidth(5)
    sns.despine(offset=10)
    fig.tight_layout()

In [18]:
plot_distributions(extreme_dist, norm_dist, bins)



In [19]:
def plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins):
    all_classes = y.unique()
    ncols = min(4, len(all_classes))
    nrows = round(len(all_classes)/ncols)
    fig, ax = plt.subplots(
        nrows, ncols, sharex=True, sharey=True, 
        figsize=(5*ncols, 4*nrows)
    )
    x = np.arange(bins[0], bins[-1]+0.01, 0.01)
    threshhold_cdf = 0.9
    threshhold_x = extreme_dist.ppf(threshhold_cdf)
    for i, (axi, cls) in enumerate(zip(ax.flatten(), all_classes)):
        x_err = errors[cls]
        color = "red"
        if cls in known_classes:
            color = "green"
        axi.hist(
            x_err, bins, cumulative=True, density=True,
            #weights=np.ones_like(errors)/np.shape(errors)[0], 
            edgecolor=color, linewidth=10, histtype="step", alpha=0.5
        )
        axi.plot(x, extreme_dist.cdf(x), lw=1, color="k", label="extreme")
        axi.plot(x, norm_dist.cdf(x), lw=10, color="b", alpha=0.1, label="normal")
        axi.axhline(y=threshhold_cdf, lw=1, linestyle="--", color="0.5")
        axi.axvline(x=threshhold_x, lw=1, linestyle="--", color="0.5")
        axi.set_title(f"Mean[{cls}]={np.mean(x_err):.4f}", fontsize=20)
    axi.legend(loc="best")
    sns.despine(offset=10)
    fig.tight_layout()

In [20]:
plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)



In [21]:
def plot_cdf_cdf(errors, known_classes, extreme_dist):
    all_classes = y.unique()
    ncols = min(4, len(all_classes))
    nrows = round(len(all_classes)/ncols)
    fig, ax = plt.subplots(
        nrows, ncols, sharex=True, sharey=True, 
        figsize=(5*ncols, 4*nrows)
    )
    bins = np.arange(0, 1.01, 0.01)
    for i, (axi, cls) in enumerate(zip(ax.flatten(), all_classes)):
        x_err = errors[cls]
        color = "red"
        if cls in known_classes:
            color = "green"
        errors_cdf = extreme_dist.cdf(x_err)
        axi.hist(
            errors_cdf, bins, cumulative=True, density=True,
            edgecolor=color, linewidth=10, histtype="step", alpha=0.5
        )
        axi.set_title(f"Mean[{cls}]={np.mean(errors_cdf):.4f}", fontsize=20)
    sns.despine(offset=10)
    fig.tight_layout()

In [22]:
plot_cdf_cdf(errors, known_classes, extreme_dist)



In [23]:
def scatter_known_unknown(X, y, errors, extreme_dist):
    fig = plt.figure()
    known_unknown = pd.concat(errors.values()).apply(extreme_dist.cdf)
    print((known_unknown > 0.95).value_counts())
    points = plt.scatter(
        data=X, 
        x="petal_length", 
        y="petal_width",
        c=known_unknown, 
        s=40, 
        cmap="viridis_r"
    )
    plt.colorbar(points)
    sns.despine(offset=10)
    
def scatter_error(X, y, errors):
    fig = plt.figure()
    points = plt.scatter(
        data=X, 
        x="petal_length", 
        y="petal_width",
        c=pd.concat(errors.values()), 
        s=40, 
        cmap="viridis_r"
    )
    plt.colorbar(points)
    sns.despine(offset=10)

In [24]:
scatter_known_unknown(X, y, errors, extreme_dist)
scatter_error(X, y, errors)


False    123
True      27
dtype: int64

Try out with a different unknown class


In [25]:
known_classes = ["setosa", "virginica"]
known_mean = get_known_mean(X, y, known_classes)
known_mean


Out[25]:
petal_length petal_width
species
setosa 1.462 0.246
virginica 5.552 2.026

In [26]:
plot_known_mean(X, y, known_mean)
errors = get_errors(X, y, known_mean)
bins = get_error_bins(errors, nbins=40)
extreme_dist, norm_dist = fit_distributions(errors, known_mean)
plot_distributions(extreme_dist, norm_dist, bins)
plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)
plot_cdf_cdf(errors, known_classes, extreme_dist)
scatter_known_unknown(X, y, errors, extreme_dist)
scatter_error(X, y, errors)


(150,)
0.003559999999999977 4.721959999999999
c=-1.7429880648379807, loc=0.02479813176141388, scale=0.04106729563327325
(array(nan), array(nan), array(nan), array(nan))
loc=0.20643, scale=0.32417747006847975
(array(0.20643), array(0.10509103), array(0.), array(0.))
False    146
True       4
dtype: int64

Try with all iris features


In [27]:
X = iris[["petal_length", "petal_width", "sepal_length", "sepal_width"]]
y = iris["species"]

In [28]:
known_classes = ["setosa", "virginica"]
known_mean = get_known_mean(X, y, known_classes)
known_mean


Out[28]:
petal_length petal_width sepal_length sepal_width
species
setosa 1.462 0.246 5.006 3.428
virginica 5.552 2.026 6.588 2.974

In [29]:
errors = get_errors(X, y, known_mean)
bins = get_error_bins(errors, nbins=40)
extreme_dist, norm_dist = fit_distributions(errors, known_mean)
plot_distributions(extreme_dist, norm_dist, bins)
plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)
plot_cdf_cdf(errors, known_classes, extreme_dist)


(150,)
0.00438000000000002 6.761180000000001
c=-0.7443015169336825, loc=0.1973869250176823, scale=0.2145743274723953
(array(0.929725), array(nan), array(nan), array(nan))
loc=0.5868100000000002, scale=0.7672856861039443
(array(0.58681), array(0.58872732), array(0.), array(0.))

Experiment with Faces data


In [30]:
from sklearn import datasets

In [31]:
faces = datasets.fetch_olivetti_faces()

In [32]:
print(faces.DESCR)


.. _olivetti_faces_dataset:

The Olivetti faces dataset
--------------------------

`This dataset contains a set of face images`_ taken between April 1992 and 
April 1994 at AT&T Laboratories Cambridge. The
:func:`sklearn.datasets.fetch_olivetti_faces` function is the data
fetching / caching function that downloads the data
archive from AT&T.

.. _This dataset contains a set of face images: http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html

As described on the original website:

    There are ten different images of each of 40 distinct subjects. For some
    subjects, the images were taken at different times, varying the lighting,
    facial expressions (open / closed eyes, smiling / not smiling) and facial
    details (glasses / no glasses). All the images were taken against a dark
    homogeneous background with the subjects in an upright, frontal position 
    (with tolerance for some side movement).

**Data Set Characteristics:**

    =================   =====================
    Classes                                40
    Samples total                         400
    Dimensionality                       4096
    Features            real, between 0 and 1
    =================   =====================

The image is quantized to 256 grey levels and stored as unsigned 8-bit 
integers; the loader will convert these to floating point values on the 
interval [0, 1], which are easier to work with for many algorithms.

The "target" for this database is an integer from 0 to 39 indicating the
identity of the person pictured; however, with only 10 examples per class, this
relatively small dataset is more interesting from an unsupervised or
semi-supervised perspective.

The original dataset consisted of 92 x 112, while the version available here
consists of 64x64 images.

When using these images, please give credit to AT&T Laboratories Cambridge.


In [33]:
faces.data.shape


Out[33]:
(400, 4096)

In [34]:
faces.target.shape


Out[34]:
(400,)

In [35]:
faces = pd.DataFrame(faces.data, columns=[f"f{i}" for i in range(faces.data.shape[1])]).assign(label=faces.target)
faces.head()


Out[35]:
f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 ... f4087 f4088 f4089 f4090 f4091 f4092 f4093 f4094 f4095 label
0 0.309917 0.367769 0.417355 0.442149 0.528926 0.607438 0.657025 0.677686 0.690083 0.685950 ... 0.669421 0.652893 0.661157 0.475207 0.132231 0.148760 0.152893 0.161157 0.157025 0
1 0.454545 0.471074 0.512397 0.557851 0.595041 0.640496 0.681818 0.702479 0.710744 0.702479 ... 0.157025 0.136364 0.148760 0.152893 0.152893 0.152893 0.152893 0.152893 0.152893 0
2 0.318182 0.400826 0.491736 0.528926 0.586777 0.657025 0.681818 0.685950 0.702479 0.698347 ... 0.132231 0.181818 0.136364 0.128099 0.148760 0.144628 0.140496 0.148760 0.152893 0
3 0.198347 0.194215 0.194215 0.194215 0.190083 0.190083 0.243802 0.404959 0.483471 0.516529 ... 0.636364 0.657025 0.685950 0.727273 0.743802 0.764463 0.752066 0.752066 0.739669 0
4 0.500000 0.545455 0.582645 0.623967 0.648760 0.690083 0.694215 0.714876 0.723140 0.731405 ... 0.161157 0.177686 0.173554 0.177686 0.177686 0.177686 0.177686 0.173554 0.173554 0

5 rows × 4097 columns

Let us limit ourselves to only 15 classes with 10 known


In [36]:
X = faces[faces.label < 16].filter(regex="^f")
y = faces[faces.label < 16]["label"]
X.shape, y.shape, y.unique().shape


Out[36]:
((160, 4096), (160,), (16,))

In [37]:
known_classes = list(range(10))
known_mean = get_known_mean(X, y, known_classes)
known_mean


Out[37]:
f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 ... f4086 f4087 f4088 f4089 f4090 f4091 f4092 f4093 f4094 f4095
label
0 0.341322 0.375620 0.416942 0.454959 0.497934 0.544628 0.583884 0.619008 0.643802 0.655785 ... 0.398347 0.390083 0.383884 0.347934 0.335124 0.301240 0.278512 0.275207 0.277686 0.276860
1 0.623967 0.645868 0.677686 0.697934 0.703306 0.707851 0.707438 0.713223 0.712810 0.708678 ... 0.414050 0.354132 0.294215 0.248347 0.206612 0.211157 0.200826 0.147934 0.125620 0.116116
2 0.378926 0.395041 0.419008 0.462810 0.504959 0.552066 0.602066 0.666116 0.713636 0.737603 ... 0.297107 0.306612 0.309504 0.307851 0.303306 0.303719 0.290909 0.292975 0.274793 0.268182
3 0.436777 0.478926 0.489669 0.500413 0.539669 0.571074 0.632231 0.691736 0.719008 0.738430 ... 0.373554 0.337190 0.344628 0.348347 0.352066 0.348760 0.319008 0.309917 0.297107 0.284711
4 0.496694 0.515289 0.565289 0.607851 0.647107 0.672314 0.689669 0.699587 0.707025 0.715289 ... 0.442562 0.443802 0.448760 0.450413 0.442149 0.424380 0.428512 0.411983 0.399587 0.395455
5 0.553719 0.585950 0.609091 0.629752 0.631405 0.676859 0.701653 0.729339 0.756612 0.783058 ... 0.608678 0.622314 0.575620 0.594215 0.635124 0.582645 0.599174 0.621488 0.590083 0.576859
6 0.237190 0.250413 0.310331 0.369835 0.471901 0.521901 0.562810 0.592975 0.628926 0.692149 ... 0.487190 0.489256 0.443802 0.446281 0.420661 0.400413 0.360331 0.329752 0.279752 0.252066
7 0.411157 0.461157 0.525620 0.603719 0.664463 0.708264 0.736364 0.764463 0.792149 0.809091 ... 0.381405 0.375620 0.364050 0.348347 0.283471 0.235124 0.206198 0.237603 0.253306 0.260744
8 0.411157 0.430165 0.450826 0.482231 0.534711 0.580579 0.611570 0.638017 0.657025 0.655785 ... 0.362810 0.364876 0.361983 0.355372 0.354132 0.357025 0.346694 0.332645 0.342562 0.345868
9 0.399174 0.471074 0.549174 0.611570 0.651240 0.677273 0.703306 0.718595 0.732231 0.736364 ... 0.141736 0.133884 0.137190 0.132231 0.122727 0.128512 0.120248 0.131405 0.149587 0.160331

10 rows × 4096 columns


In [38]:
%time errors = get_errors(X, y, known_mean)
extreme_dist, norm_dist = fit_distributions(errors, known_mean)


CPU times: user 3min 36s, sys: 31.2 ms, total: 3min 36s
Wall time: 3min 39s
c=-0.12452055920104532, loc=30.314841146660864, scale=14.255744567468316
(array(40.53064726), array(494.32552816), array(2.18340756), array(11.09063013))
loc=40.56266403198242, scale=22.55886459350586
(array(40.56266403), array(508.90237427), array(0.), array(0.))

In [39]:
bins = get_error_bins(errors, nbins=40)
%time plot_distributions(extreme_dist, norm_dist, bins)
%time plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)
%time plot_cdf_cdf(errors, known_classes, extreme_dist)


(160,)
10.425526 134.32831
CPU times: user 172 ms, sys: 15.6 ms, total: 188 ms
Wall time: 176 ms
CPU times: user 1.86 s, sys: 78.1 ms, total: 1.94 s
Wall time: 1.93 s
CPU times: user 1.78 s, sys: 31.2 ms, total: 1.81 s
Wall time: 1.82 s

In [ ]: